61 research outputs found

    Towards using web-crawled data for domain adaptation in statistical machine translation

    Get PDF
    This paper reports on the ongoing work focused on domain adaptation of statistical machine translation using domain-specific data obtained by domain-focused web crawling. We present a strategy for crawling monolingual and parallel data and their exploitation for testing, language modelling, and system tuning in a phrase--based machine translation framework. The proposed approach is evaluated on the domains of Natural Environment and Labour Legislation and two language pairs: English–French and English–Greek

    Web crawling and domain adaptation methods for building English–Greek machine translation systems for the culture/tourism domain

    Get PDF
    Informe técnico sobre el trabajo realizado por Víctor Manuel Sánchez Cartagena en una estancia en "Athena Research and Innovation Center", mientras estaba contratado por la empresa Prompsit Language Engineering y era colaborador honorífico en el Departamento de Lenguajes y Sistemas Informáticos de la Universidad de Alicante.This paper describes the process we followed in order to build English-Greek machine translation systems for the tourism/culture domain. We experimented with different data sets and domain adaptation methods for statistical machine translation and also built neural machine translation systems. The in-domain data were obtained by means of the ILSP Focused Crawler.The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran)

    D7.1. Criteria for evaluation of resources, technology and integration.

    Get PDF
    This deliverable defines how evaluation is carried out at each integration cycle in the PANACEA project. As PANACEA aims at producing large scale resources, evaluation becomes a critical and challenging issue. Critical because it is important to assess the quality of the results that should be delivered to users. Challenging because we prospect rather new areas, and through a technical platform: some new methodologies will have to be explored or old ones to be adapted

    D4.1. Technologies and tools for corpus creation, normalization and annotation

    Get PDF
    The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition

    Adquisición automática de recursos para traducción automática en el proyecto Abu-MaTran

    Get PDF
    This paper provides an overview of the research and development activities carried out to alleviate the language resources' bottleneck in machine translation within the Abu-MaTran project. We have developed a range of tools for the acquisition of the main resources required by the two most popular approaches to machine translation, i.e. statistical (corpora) and rule-based models (dictionaries and rules). All these tools have been released under open-source licenses and have been developed with the aim of being useful for industrial exploitation.Este artículo presenta una panorámica de las actividades de investigación y desarrollo destinadas a aliviar el cuello de botella que supone la falta de recursos lingüísticos en el campo de la traducción automática que se han llevado a cabo en el ámbito del proyecto Abu-MaTran. Hemos desarrollado un conjunto de herramientas para la adquisición de los principales recursos requeridos por las dos aproximaciones m as comunes a la traducción automática, modelos estadísticos (corpus) y basados en reglas (diccionarios y reglas). Todas estas herramientas han sido publicadas con licencias libres y han sido desarrolladas con el objetivo de ser útiles para ser explotadas en el ámbito comercial.The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran)

    Web crawling and domain adaptation methods for building English–Greek machine translation systems for the culture/tourism domain

    Get PDF
    Informe técnico sobre el trabajo realizado por Víctor Manuel Sánchez Cartagena en una estancia en "Athena Research and Innovation Center", mientras estaba contratado por la empresa Prompsit Language Engineering y era colaborador honorífico en el Departamento de Lenguajes y Sistemas Informáticos de la Universidad de Alicante.This paper describes the process we followed in order to build English-Greek machine translation systems for the tourism/culture domain. We experimented with different data sets and domain adaptation methods for statistical machine translation and also built neural machine translation systems. The in-domain data were obtained by means of the ILSP Focused Crawler.The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran)
    corecore